Plan for today

  • A gentle introduction to calibration/inverse problem
    • Finding \(x = f^{-1}(y)\)
  • A live example of calibrating an Agent Based model of Ebola
    • Surrogate models as emulators
    • Considerations for stochastic simulators
    • Multi-variate output
  • Results from a real application




Slides

https://fadikar.com/DMDU-talk/

Calibration

%%{init: {'flowchart': {'curve': 'linear'}, 'theme': 'base', 'themeVariables': { 'fontSize': '18px', 'fontFamily': 'arial'}, 'flowchart': {'htmlLabels': true, 'useMaxWidth': true}}}%%
graph TD
    A["Input Parameters"] -->|Feed to| B["Simulator"]
    B -->|Produces| C["Simulated Output"]
    D["Observations"]
    D -->|Compare with| E["Calibration Algorithm"]
    C -->|Input to| E
    E -->|Returns| F["Estimated Input Parameters"]
    F -.->|Feedback Loop| A
    
    linkStyle 0,1,2,3,4,5 stroke:#333,stroke-width:2px,color:#FF1493
    

    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
    style E fill:#fce4ec
    style F fill:#e1f5ff

Calibration with emulator

Emulator Training

%%{init: {'fontSize': '20px'}, 'flowchart': {'htmlLabels': true, 'useMaxWidth': true}}%%
graph TD
    A["Input Parameters"]
    A -->|Feed to| C["Simulator"]
    C -->|Produces| E["Simulated Output"]
    E -->|Train| B["Emulator"]
    
    linkStyle 0,1,2 stroke:#333,stroke-width:2px,color:#FF1493

    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#fff3e0
    style E fill:#f3e5f5

Calibration

%%{init: {'fontSize': '20px'}}%%
graph TD
    A["Input Parameters"]
    A -->|Feed to| B["Emulator"]
    B -->|Produces| D["Emulated Output"]
    F["Observations"] -->|Compare with| G["Calibration Algorithm"]
    D -->|Input to| G
    G -->|Returns| H["Estimated Input Parameters"]
    H -.->|Feedback Loop| A
    
    linkStyle 0,1,2,3,4,5 stroke:#333,stroke-width:2px,color:#FF1493

    style A fill:#e1f5ff
    style B fill:#fff3e0
    style D fill:#f3e5f5
    style F fill:#e8f5e9
    style G fill:#fce4ec
    style H fill:#e1f5ff

Example: Ebola model

  • An agent-based model for Ebola in Liberia.
  • Depends on a highly detailed synthetic population with realistic contact networks and demographic characteristics.
  • 5 dimensional inputs
    • transmission rate (\(\theta_1\))
    • initial infections (\(\theta_2\))
    • hospital intervention delay (\(\theta_3\))
    • hospital intervention efficacy (\(\theta_4\))
    • travel intervention (\(\theta_5\))
  • Time-series output of daily infections
  • Stochastic output

The Calibration Challenge

We were given:

  • A ground truth time series of observed Ebola cases.
  • A set of simulation runs pre specified parameter values.

Goal:

  • To find calibrated model which can be used to make forward predictions and run scenario analyses.

\[ \text{Find} \;\;\; \Theta^* = \left\{(\theta_1, \cdots, \theta_5): f^{\text{sim}}(\theta_1, \cdots, \theta_5)\approx \left(Y_1^{(obs)}, \cdots, Y_t^{(obs)}\right)\right\} \]

Simulation Campaign

Simulation Input

Simulation Output



Build an Emulator

  • We don’t have access to the (expnesive) Ebola ABM
  • Even if we had, we wouldn’t be able to run arbitrarily large number of simulations due to budget constraints

What is an emulator?

  • An approximation of the actual simulator.
  • Cheap to evaluate, often limited in functionality.
  • Desirable to have meaningful statistical properties.

Gaussian Process as an Emulator

  • A non-parametric Bayesian approach to modeling functions
  • Defines a distribution over functions, not just parameters
  • Any finite collection of function values follows a multivariate Gaussian distribution

Note

\[z(x) \sim \mathcal{GP}(m(x), k(x, x'))\]

where \(m(x)\) is the mean function and \(k(x, x')\) is the covariance (kernel) function

  • Core idea: Encode relationship between \(z(x)\) and \(z(x')\) based on distance between \(x\) and \(x'\)
    • If \(x\) and \(x'\) are close, then \(z(x)\) and \(z(x')\) should be similar
    • Covariance function \(k(x, x')\) captures this spatial correlation

Covariance Kernels

Samples from unconditional GP

  • Define locations \((x_1, x_2, \cdots, x_n)\).
  • Evaluate the covariance matrix \(K\) according to the covariance kernel.
  • Draw sample from \(MVN(0, K)\).



Common Covariance Kernels

  • RBF (Radial Basis Function / Squared Exponential): \(k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)\)

  • Matérn: \(k(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\|x - x'\|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x - x'\|}{\ell}\right)\)

  • Periodic: \(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right)\)

  • Linear: \(k(x, x') = \sigma^2 (x - c)(x' - c)\)

Covariance Kernels

Samples from unconditional GP

Covariance Kernel - RBF

Samples from unconditional GP

Covariance Kernel - RBF

  • Observe: \((X, Z)\) where \(X = \{x_1, ..., x_n\}\) and \(Z = \{z_1, ..., z_n\}\)
  • Test points: \(X_* = \{x_1^*, ..., x_m^*\}\)

Joint Distribution

The joint distribution of observed and test points is:

\[\begin{bmatrix} Z \\ Z_* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu \\ \mu_* \end{bmatrix}, \begin{bmatrix} K(X, X) + \sigma_n^2 I & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \end{bmatrix} \right)\]

where \(\sigma_n^2\) is the observation noise variance

Conditional Distribution

The posterior distribution \(Z_* | X, Z, X_*\) is Gaussian:

\[Z_* | X, Z, X_* \sim \mathcal{N}(\mu_{*|Z}, \Sigma_{*|Z})\]

where:

  • \(\mu_{*|Z} = \mu_* + K(X_*, X)[K(X, X) + \sigma_n^2 I]^{-1}(Z - \mu)\)

  • \(\Sigma_{*|Z} = K(X_*, X_*) - K(X_*, X)[K(X, X) + \sigma_n^2 I]^{-1}K(X, X_*)\)

Covariance Kernel - RBF

Samples from conditional GP

Covariance Kernel - RBF

Samples from conditional GP

Back to the problem

Consider a smaller problem.

  • Calibrate to the observed cumulative infections at the end of week 30.
  • Pretend that the simulation is deterministic (only use one replicate at each input).
  • Need an emulator for 30 weeks cumulative infections.

Emulator for week 30 cumulative infections

Data Preprocessing

# d is a matrix of order 100x5, where each row represents one unique combination of
# (theta_1, ..., theta_5)
# Normalize input parameters to [0, 1]
d_min <- apply(d, 2, min)
d_max <- apply(d, 2, max)
d_normalized <- sweep(sweep(d, 2, d_min, "-"), 2, d_max - d_min, "/")

# Transform output
sim30_log <- log(sim30 + 1)

y_std <- (sim30_log - mean(sim30_log)) / sd(sim30_log)
obs30_std <- (log(obs30) - mean(sim30_log)) / sd(sim30_log)

plot(y_std, ylab = "Log Cumulative infections at week 30", xlab = "parameter index")
abline(h = obs30_std, col = "red", cex = 1.2)
text(40, obs30_std-2, "observation", col = 2, cex = 1.2)

GP fitting

library(hetGP)
gp_fit <- mleHomGP(X = d_normalized, Z = y_std, known = list(beta0 = 0))
gp_fit
N =  100  n =  100  d =  5 
Gaussian  covariance lengthscale values:  2.307185 2.470611 2.124024 2.470611 2.470611 
Homoskedastic nugget value:  0.03716173 
Variance/scale hyperparameter:  1.73618 
Given constant trend value:  0 
MLE optimization: 
 Log-likelihood =  -40.87428 ; Nb of evaluations (obj, gradient) by L-BFGS-B:  39 39 ; message:  CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH 
plot(gp_fit)

Run Calibration

Plug the emulator in any optimization/inference routine of your choice to find good \((\theta_1, \cdots, \theta_5)\) candidates.

  • We use Approximate Bayesian Computation
# ABC settings
n_abc <- 100000
tolerance <- 0.2

# generate samples from prior
abc_samples <- matrix(runif(n_abc * 5), nrow = n_abc, ncol = 5)
colnames(abc_samples) <- paste0('theta_', 1:5)

pred_abc <- predict(gp_fit, abc_samples)
pred_mean_abc <- pred_abc$mean
pred_sd_abc <- sqrt(pred_abc$sd2 + pred_abc$nugs)

discrepancy <- abs(pred_mean_abc - obs30_std)

accepted <- discrepancy < tolerance
n_accepted <- sum(accepted)

cat(sprintf("ABC Results:\n"))
cat(sprintf("  Samples accepted: %d / %d (%.2f%%)\n", 
            n_accepted, n_abc, 100 * n_accepted / n_abc))

abc_accepted <- abc_samples[accepted, ]

abc_accepted_original <- sweep(abc_accepted, 2, d_max - d_min, "*")
abc_accepted_original <- sweep(abc_accepted_original, 2, d_min, "+")

par(mfrow = c(2, 3))
for (i in 1:5){
  hist(abc_accepted_original[, i], main = expression(paste("Posterior of ", theta[i])),
  xlab = expression(paste(theta[i])))
}

hist(exp(pred_mean_abc[accepted] * sd(sim30_log) + mean(sim30_log)), main = "Posterior cumulative infections", xlim = c(5000, 8000), xlab = "cumulative infections")
abline(v = obs30, col = 2, lty = 2, cex = 1.2)

Run Calibration

Plug the emulator in any optimization/inference routine of your choice to find good \((\theta_1, \cdots, \theta_5)\) candidates.

  • We use Approximate Bayesian Computation
ABC Results:
  Samples accepted: 7340 / 1000000 (0.73%)

How to deal with stochastic output

  • A natural approach is to use the mean (and variance) of the simulation output as the summary statistic for comparison with observations.
  • A GP emulator can approximate both the mean and the variance given the input.
# flatten the replicated output
sim30 <- as.numeric(sima[, , 30])
sim30_log <- log(sim30 + 1)

y_std <- (sim30_log - mean(sim30_log)) / sd(sim30_log)

# generate a replicated design matrix
d_normalized_rep <- d_normalized[rep(1:100, each = 100), ]

gp_fit <- mleHetGP(X = d_normalized_rep, Z = y_std, known = list(beta0 = 0), 
                  covtype = "Matern5_2")
gp_fit
N =  10000  n =  100  d =  5 
Matern5_2  covariance lengthscale values of the main process:  0.5774421 0.3996435 1.013529 1.003587 1.377869 
Variance/scale hyperparameter:  0.5919967 
Summary of Lambda values: 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003310 0.008398 0.017072 0.396408 0.188800 4.607871 
Given constant trend value:  0 
Matern5_2  covariance lengthscale values of the log-noise process:  0.5774421 0.3996435 1.013529 1.003587 1.377869 
Nugget of the log-noise process:  1e-06 
Estimated constant trend value of the log-noise process:  -1.955257 
MLE optimization: 
 Log-likelihood =  3351.833 ; Nb of evaluations (obj, gradient) by L-BFGS-B:  44 44 ; message:  CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH 

How to deal with stochastic output

  • A natural approach is to use the mean (and variance) of the simulation output as the summary statistic for comparison with observations.
  • A GP emulator can approximate both the mean and the variance given the input.
plot(gp_fit)

Issue with Gaussian assumption

  • The replicate variablity can not be sufficiently described by a mean and a variance of a Gaussian distribution.

Use Quantiles

  • Describe the replicate variabilty via empirical quantiles.
  • Model the quantiles using GP.

Quantile GP

# compute empirical quantiles
sim30_q <- apply(sima[, , 30], 2, quantile, probs = c(0.05, 0.275, 0.5, 0.725, 0.95))
sim30_q[, 1:5]
          [,1]    [,2]     [,3]     [,4]      [,5]
5%    11024.35    1.00 138572.8 10586.20  77403.65
27.5% 13514.57    3.00 155366.4 12675.42  86900.82
50%   14480.50 1299.50 167211.5 13970.00  94504.50
72.5% 15379.62 3141.25 180769.0 14909.77 103736.58
95%   17323.05 5069.15 198685.7 18056.65 117143.95
# augment the quantile input to the design matrix
d_normalized_q <- cbind(d_normalized[rep(1:100, each = 5), ], 
                        rep(c(0.05, 0.275, 0.5, 0.725, 0.95), 5))
colnames(d_normalized_q)[6] <- "alpha"
d_normalized_q[1:10, ]                       
       theta_1   theta_2 theta_3  theta_4  theta_5 alpha
 [1,] 0.535354 0.8947368   0.000 0.858586 0.727273 0.050
 [2,] 0.535354 0.8947368   0.000 0.858586 0.727273 0.275
 [3,] 0.535354 0.8947368   0.000 0.858586 0.727273 0.500
 [4,] 0.535354 0.8947368   0.000 0.858586 0.727273 0.725
 [5,] 0.535354 0.8947368   0.000 0.858586 0.727273 0.950
 [6,] 0.565657 0.0000000   0.375 0.313131 0.323232 0.050
 [7,] 0.565657 0.0000000   0.375 0.313131 0.323232 0.275
 [8,] 0.565657 0.0000000   0.375 0.313131 0.323232 0.500
 [9,] 0.565657 0.0000000   0.375 0.313131 0.323232 0.725
[10,] 0.565657 0.0000000   0.375 0.313131 0.323232 0.950

Quantile GP

  • Feed the augmented data into any standard GP module.
sim30_q_log <- log(sim30_q + 1)
sim30_q_log <- as.numeric(sim30_q_log)

y_std <- (sim30_q_log - mean(sim30_q_log)) / sd(sim30_q_log)

gp_fit <- mleHomGP(X = d_normalized_q, Z = y_std, known = list(beta0 = 0), 
                  covtype = "Matern5_2")

plot(gp_fit)

Quantile GP + ABC

ABC Results:
  Samples accepted: 922 / 100000 (0.92%)

Emulator for Multivariate Output

  • So far we have emulated scalar output - cumulative infections at week 30.
  • How can we extend GP emulator to time-series output?
    • Use time as an input to GP - not scalable
    • Use basis decomposition

\[\mathbf{f}^{\text{sim}}(\theta) = \underbrace{\phi_0}_{\text{mean}} + \sum_{k=1}^{p} \underbrace{\phi_k}_{\text{basis}} \underbrace{w_k(\theta)}_{\text{GP}} + \epsilon\]

\[\mathbf{f}(\boldsymbol{\theta}) = \begin{bmatrix} f_1(\boldsymbol{\theta}) \\ f_2(\boldsymbol{\theta}) \\ \vdots \\ f_{30}(\boldsymbol{\theta}) \end{bmatrix} = \phi_0 + \overbrace{ \begin{bmatrix} \phi_{11} & \phi_{12} & \cdots & \phi_{1p} \\ \phi_{21} & \phi_{22} & \cdots & \phi_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{30,1} & \phi_{30,2} & \cdots & \phi_{30,p} \end{bmatrix}}^{K} \begin{bmatrix} w_1(\boldsymbol{\theta}) \\ w_2(\boldsymbol{\theta}) \\ \vdots \\ w_p(\boldsymbol{\theta}) \end{bmatrix}. \]

Emulator for Multivariate Output

Use singular value decomposition

\[ \mathbf{f} = \begin{bmatrix} f_{1}(\theta_{1},\alpha_{1}) & f_{1}(\theta_{1},\alpha_{2}) & \cdots & f_{1}(\theta_{1},\alpha_{5}) & \cdots & f_{1}(\theta_{100},\alpha_{1}) & \cdots & f_{1}(\theta_{100},\alpha_{5}) \\ f_{2}(\theta_{1},\alpha_{1}) & f_{2}(\theta_{1},\alpha_{2}) & \cdots & f_{2}(\theta_{1},\alpha_{5}) & \cdots & f_{2}(\theta_{100},\alpha_{1}) & \cdots & f_{2}(\theta_{100},\alpha_{5}) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \ddots & \vdots \\ f_{30}(\theta_{1},\alpha_{1}) & f_{30}(\theta_{1},\alpha_{2}) & \cdots & f_{30}(\theta_{1},\alpha_{5}) & \cdots & f_{30}(\theta_{100},\alpha_{1}) & \cdots & f_{30}(\theta_{100},\alpha_{5}) \end{bmatrix} = U_{30 \times r}\, D_{r \times r}\, V^{\top}_{r \times (500)} . \]



\(\mathbf{f} = K * W\\ K = \sqrt{500} \; U * D \\ W = \frac{1}{\sqrt{500}} V'\)

Emulator for Multivariate Output

# arrange the quantiles in (30 x 500) matrix
sim30_t <- log(sima[, , 1:30] + 1)

sim30_t_q <- apply(sim30_t, c(2, 3), quantile, probs = c(0.05, 0.275, 0.5, 0.725, 0.95))
sim30_t_q <- aperm(sim30_t_q, c(3, 2, 1))

Y_mat <- t(vapply(seq_len(30), function(w) {
  as.vector(t(sim30_t_q[w, , ]))  # row-wise flatten: point-major, quant-within-point
}, numeric(100 * 5)))


## normalize
Y_mean <- apply(Y_mat, 1, mean)
Y_sd <- sd(Y_mat)

Y_mat_std <- (Y_mat - matrix(Y_mean, nrow = 30, ncol = 500)) / Y_sd


## SVD
udv <- svd(Y_mat_std)

K <- udv$u %*% diag(udv$d) * sqrt(ncol(Y_mat_std))

par(mfrow = c(2, 1))
plot(cumsum(udv$d / sum(udv$d)), type = "b", xlab = "no of component (p)", ylab = "cumulative sum")
abline(h = c(0.95, 0.99))


## check where the cumulative eigen value crosses .95
nK <- which(cumsum(udv$d / sum(udv$d)) > .95)[1] - 1

K <- K[, 1:nK]
matplot(K, type = "l", xlab = "weeks", ylab = expression(paste(phi)))

Emulator for Multivariate Output

GP to model \(W_i(\theta)\)

W <- udv$v[, 1:nK] / sqrt(ncol(Y_mat_std))

str(W)
 num [1:500, 1:4] -0.00071 -0.00104 -0.00119 -0.00133 -0.00152 ...
## fit GP to each column of W
gps <- list()
for (ii in 1:nK){
  gps[[ii]] <- mleHomGP(d_normalized_q, W[, ii], known = list(beta0 = 0), covtype = "Matern5_2")
}

Final Emulator

Prediction at an untried input \((\theta_1^*, \cdots, \theta_5^*, \alpha^*)\)

Full Bayesian Inference

  • Explore the posterior distribution via MCMC (can be expensive in some situations)
  • Posterior is obtained from gaussian likelihood, and suitable priors for the parameters, e.g. gamma priors for error precisions, sparcity enforced prior for GP covariance parameters etc.

  • Estimated posterior distribution of the parameters \((\theta_1, \cdots, \theta_5, \alpha)\)
  • The diagonal shows the estimated marginal posterior pdf for each parameter
  • The off-diagonal images give estimates of bivariate marginals; to contour lines show estimated 90% hpd regions.

Summary

  • Gaussian Process emulator.

  • Stochastic simulations.

  • Multi-variate output